Introduction

In this section, we will focus on comparing the expression levels of genes across different samples. This analysis sets aside the task of estimating the different kinds of RNA molecules, and the different isoforms for genes with multiple isoforms. One advantage of looking at these matrices of counts is that we can use statistical distributions to model how the variance of counts will change when the counts are low vs high. We will explore the relationship of the variance of counts to the mean later in this lab.

Counting reads in genes

In this section we will examine 75 samples from Waldenstrom Macroglobulinemia patients, in a cohort called ???zhunter,??? from Harvard University.

Construct sample.table

setwd("/Users/yah2014/Dropbox/Public/Olivier/R/zhunter/Mutation");getwd() #set_up_enviroment
## [1] "/Users/yah2014/Dropbox/Public/Olivier/R/zhunter/Mutation"
sample_zhunter.table <- read.csv("sample_table_zhunter.csv")
fileName <-paste0(sample_zhunter.table$BioSample, ".bam.count")
sample_zhunter.table <- data.frame(sampleName = sample_zhunter.table$sampleName,
                                   fileName = fileName,
                                   Condition= sample_zhunter.table$Condition,
                                   QC = sample_zhunter.table$QC,
                                   Label= sample_zhunter.table$Label,
                                   BioSample = sample_zhunter.table$BioSample)
head(sample_zhunter.table)
##          sampleName                                fileName Condition   QC
## 1 ZH10_NWM07_CTTGTA ZH10_NWM07_CTTGTA_L005_R1_001.bam.count    Cancer PASS
## 2      ZH10_rnaWT10          ZH10_rnaWT10_R1_.tmp.bam.count   Healthy PASS
## 3 ZH11_NWM09_ACAGTG ZH11_NWM09_ACAGTG_L005_R1_001.bam.count    Cancer PASS
## 4      ZH11_rnaWT11          ZH11_rnaWT11_R1_.tmp.bam.count   Healthy PASS
## 5 ZH12_NWM10_GTGAAA ZH12_NWM10_GTGAAA_L006_R1_001.bam.count    Cancer PASS
## 6      ZH12_rnaWT12          ZH12_rnaWT12_R1_.tmp.bam.count   Healthy PASS
##   Label                     BioSample
## 1   NWM ZH10_NWM07_CTTGTA_L005_R1_001
## 2 rnaWT          ZH10_rnaWT10_R1_.tmp
## 3   NWM ZH11_NWM09_ACAGTG_L005_R1_001
## 4 rnaWT          ZH11_rnaWT11_R1_.tmp
## 5   NWM ZH12_NWM10_GTGAAA_L006_R1_001
## 6 rnaWT          ZH12_rnaWT12_R1_.tmp

Creating a DESeqDataSet object

invisible(library(DESeq2))
directory <- "/Users/yah2014/Dropbox/Public/Olivier/R/ALL_COUNTS/zhunter_Counts" #dir for HTSeq Counts
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sample_zhunter.table,
                                       directory = directory,
                                       design= ~ Condition)
ddsHTSeq
## class: DESeqDataSet 
## dim: 23718 75 
## metadata(1): version
## assays(1): counts
## rownames(23718): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(75): ZH10_NWM07_CTTGTA ZH10_rnaWT10 ... ZH9_NWM06_GCCAAT
##   ZH9_rnaWT9
## colData names(4): Condition QC Label BioSample

if Error in colnames<-(*tmp*, value = 1:75) : attempt to set ‘colnames’ on an object with less than two dimensions clean the first row with the empty first column and “0” in the second column in HTSeq.count files use terminal, run below command lines to remove first line for all files: cd /Users/yah2014/Dropbox/Public/Olivier/R/ALL_COUNTS/zhunter_Counts for file in \((ls);do echo "\)(tail -n +2 $file)" > $file;done

Normalization for sequencing depth

Pre-filtering

removing rows in which there are no reads or nearly no reads

ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]

Note on factor levels

ddsHTSeq$Condition <- factor(ddsHTSeq$Condition, levels=c("Cancer","Healthy","Unclear"))
ddsHTSeq$Label <- factor(ddsHTSeq$Label, levels=c("NWM","rnaWT","WM","MWCL","BCWM"))

The following estimates size factors to account for differences in sequencing depth, and is only necessary to make the log.norm.counts object below.

dds <- estimateSizeFactors(ddsHTSeq)
head(sizeFactors(dds))
## ZH10_NWM07_CTTGTA      ZH10_rnaWT10 ZH11_NWM09_ACAGTG      ZH11_rnaWT11 
##         1.1600472         0.5454389         1.5941689         0.4574318 
## ZH12_NWM10_GTGAAA      ZH12_rnaWT12 
##         2.0997319         0.2261451
head(colSums(counts(dds)))
## ZH10_NWM07_CTTGTA      ZH10_rnaWT10 ZH11_NWM09_ACAGTG      ZH11_rnaWT11 
##          46912880          17611380          45321999          15580851 
## ZH12_NWM10_GTGAAA      ZH12_rnaWT12 
##          58828636           6868649
plot(sizeFactors(dds), colSums(counts(dds)))
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))

Size factors are calculated by the median ratio of samples to a pseudo-sample (the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios in this histogram.

loggeomeans <- rowMeans(log(counts(dds)))
hist(log(counts(dds)[,1]) - loggeomeans, 
     col="grey", main="", xlab="", breaks=40)

The size factor for the first sample:

exp(median((log(counts(dds)[,1]) - loggeomeans)[is.finite(loggeomeans)]))
## [1] 1.160047
sizeFactors(dds)[1]
## ZH10_NWM07_CTTGTA 
##          1.160047

Make a matrix of log normalized counts (plus a pseudocount):

log.norm.counts <- log2(counts(dds, normalized=TRUE) + 1)

Examine the log normalized counts (plus a pseudocount).

rs <- rowSums(counts(dds))
boxplot(log.norm.counts[rs > 0,]) # normalized

Make a scatterplot of log normalized counts against each other. Note the fanning out of the points in the lower left corner, for points less than \(2^5 = 32\).

plot(log.norm.counts[,1:2], cex=.1)

Stabilizing count variance

Now we will use a more sophisticated transformation, which is similar to the variance stablizing normalization method taught in Week 3 of Course 4: Introduction to Bioconductor. It uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog is very close to log2. For further details, see the section in the DESeq2 paper.
This will take five minutes!!

#rld <- rlog(dds)
#plot(assay(rld)[,1:2], cex=.1)

Another transformation for stabilizing variance in the DESeq2 package is varianceStabilizingTransformation. These two tranformations are similar, the rlog might perform a bit better when the size factors vary widely, and the varianceStabilizingTransformation is much faster when there are many samples.

vsd <- varianceStabilizingTransformation(dds)
plot(assay(vsd)[,1:2], cex=.1)

We can examine the standard deviation of rows over the mean for the vsd. Note that the genes with high variance for the log come from the genes with lowest mean. If these genes were included in a distance calculation, the high variance at the low count range might overwhelm the signal at the higher count range.

library(vsn)
meanSdPlot(assay(vsd), ranks=FALSE)

The principal components (PCA) plot is a useful diagnostic for examining relationships between samples:

Using the VST:

plotPCA(vsd, intgroup="Condition")

We can make this plot even nicer using custom code from the ggplot2 library:

library(ggplot2)
head((data <- plotPCA(vsd, intgroup=c("Condition","Label"), returnData=TRUE)))
##                          PC1        PC2           group Condition Label
## ZH10_NWM07_CTTGTA  -4.403314  13.418230    Cancer : NWM    Cancer   NWM
## ZH10_rnaWT10      -17.049619  -9.921083 Healthy : rnaWT   Healthy rnaWT
## ZH11_NWM09_ACAGTG   8.591741   8.930622    Cancer : NWM    Cancer   NWM
## ZH11_rnaWT11       17.688041 -27.323018 Healthy : rnaWT   Healthy rnaWT
## ZH12_NWM10_GTGAAA  19.544231  -9.537662    Cancer : NWM    Cancer   NWM
## ZH12_rnaWT12        1.150560 -19.160230 Healthy : rnaWT   Healthy rnaWT
##                                name
## ZH10_NWM07_CTTGTA ZH10_NWM07_CTTGTA
## ZH10_rnaWT10           ZH10_rnaWT10
## ZH11_NWM09_ACAGTG ZH11_NWM09_ACAGTG
## ZH11_rnaWT11           ZH11_rnaWT11
## ZH12_NWM10_GTGAAA ZH12_NWM10_GTGAAA
## ZH12_rnaWT12           ZH12_rnaWT12
(percentVar <- 100*round(attr(data, "percentVar"),2))
## [1] 24 15
makeLab <- function(x,pc) paste0("PC",pc,": ",x,"% variance")
ggplot(data, aes(PC1,PC2,col=Condition,shape=Label)) + geom_point() +
  xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2))

In addition, we can plot a hierarchical clustering based on Euclidean distance matrix:

plot(hclust(dist(t(log.norm.counts))), labels=colData(dds)$Condition,cex = 0.75)

plot(hclust(dist(t(assay(vsd)))), labels=colData(vsd)$Condition,cex = 0.75)

Differential gene expression

Modeling raw counts with normalization

We will now perform differential gene expression on the counts, to try to find genes in which the differences in expected counts across samples due to the condition of interest rises above the biological and technical variance we observe.

We will use an overdispersed Poisson distribution – called the negative binomial – to model the raw counts in the count matrix. The model will include the size factors into account to adjust for sequencing depth. The formula will look like:

\[ K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) \]

where \(K_{ij}\) is a single raw count in our count table, \(s_{ij}\) is a size factor or more generally a normalization factor, \(q_{ij}\) is proportional to gene expression (what we want to model with our design variables), and \(\alpha_i\) is a dispersion parameter.

Why bother modeling raw counts, rather than dividing out the sequencing depth and working with the normalized counts? In other words, why put the \(s_{ij}\) on the right side of the equation above, rather than dividing out on the left side and modeling \(K_{ij} / s_{ij}\). The reason is that, with the raw count, we have knowledge about the link between the expected value and its variance. So we prefer the first equation below to the second equation, because with the first equation, we have some additional information about the variance of the quantity on the left hand side.

\[ K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} ) \]

\[ \frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij}) \]

When we sample cDNA fragments from a pool in a sequencing library, we can model the count of cDNA fragments which originated from a given gene with a binomial distribution, with a certain probability of picking a fragment for that gene which relates to factors such as the expression of that gene (the abundance of mRNA in the original population of cells), its length and technical factors in the production of the library. When we have many genes, and the rate for each gene is low, while the total number of fragments is high, we know that the Poisson is a good model for the binomial. And for the binomial and the Poisson, there is an explicit link between on observed count and its expected variance.

A number of methods for assessing differential gene expression from RNA-seq counts use the negative binomial distribution to make probabilistic statements about the differences seen in an experiment. A few such methods are edgeR, DESeq2, and DSS. Other methods, such as limma+voom find other ways to explicitly model the mean of log counts and the observed variance of log counts. A very incomplete list of statistical methods for RNA-seq differential expression is provided in the footnotes.

DESeq2 performs a similar step to limma as discussed in PH525x Course 3, in using the variance of all the genes to improve the variance estimate for each individual gene. In addition, DESeq2 shrinks the unreliable fold changes from genes with low counts, which will be seen in the resulting MA-plot.

Experimental design and running DESeq2

Remember, we had created the DESeqDataSet object earlier using the following line of code (or alternatively using DESeqDataSetFromMatrix)

dds <- DESeqDataSet(ddsHTSeq, design= ~ Condition)

First, we setup the design of the experiment, so that differences will be considered across time and protocol variables. We can read and if necessary reset the design using the following code.

design(dds)
## ~Condition
design(dds) <- ~Condition

The last variable in the design is used by default for building results tables (although arguments to results can be used to customize the results table), and we make sure the “control” or “untreated” level is the first level, such that log fold changes will be treated over control, and not control over treated.

levels(dds$Condition)
## [1] "Cancer"  "Healthy" "Unclear"
dds$Condition <- relevel(dds$Condition, "Unclear")
dds$Condition <- relevel(dds$Condition, "Healthy")
levels(dds$Condition)
## [1] "Healthy" "Unclear" "Cancer"

The following line runs the DESeq2 model. After this step, we can build a results table, which by default will compare the levels in the last variable in the design, so the Condition treatment in our case:

dds <- DESeq(dds)
res <- results(dds)

Examining results tables

head(res)
## log2 fold change (MAP): Condition Cancer vs Healthy 
## Wald test p-value: Condition Cancer vs Healthy 
## DataFrame with 6 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat       pvalue
##           <numeric>      <numeric> <numeric> <numeric>    <numeric>
## A1BG     310.314358     -1.1415070 0.2167890 -5.265521 1.397924e-07
## A1BG-AS1  34.601285     -0.4264891 0.2095279 -2.035477 4.180294e-02
## A1CF      23.336867     -2.2869681 0.5472728 -4.178845 2.929937e-05
## A2M      600.882478     -1.2885536 0.4425605 -2.911588 3.595970e-03
## A2M-AS1    6.710175     -1.3377740 0.4963390 -2.695283 7.032896e-03
## A2ML1      8.768966     -4.3359890 0.6396153 -6.779058 1.209618e-11
##                  padj
##             <numeric>
## A1BG     7.363644e-07
## A1BG-AS1 6.472283e-02
## A1CF     9.823723e-05
## A2M      7.400281e-03
## A2M-AS1  1.343151e-02
## A2ML1    1.181273e-10
table(res$padj < 0.1)
## 
## FALSE  TRUE 
##  6958 15171

A summary of the results can be generated:

summary(res)
## 
## out of 23049 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 4329, 19% 
## LFC < 0 (down)   : 10842, 47% 
## outliers [1]     : 920, 4% 
## low counts [2]   : 1, 0.0043% 
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

For testing at a different threshold, we provide the alpha to results, so that the mean filtering is optimal for our new FDR threshold.

res2 <- results(dds, alpha=0.05)
table(res2$padj < 0.05)
## 
## FALSE  TRUE 
##  8344 13785

Exporting results to CSV files

resO5rdered <- res2[order(res2$padj),]
write.csv(as.data.frame(resO5rdered), 
          file="condition_treated_results.csv")

Visualizing results

The MA-plot provides a global view of the differential genes, with the log2 fold change on the y-axis over the mean of normalized counts:

plotMA(res, ylim=c(-8,4))

We can also test against a different null hypothesis. For example, to test for genes which have fold change more than doubling or less than halving:

res.thr <- results(dds, lfcThreshold=1)
plotMA(res.thr, ylim=c(-8,4))

A p-value histogram:

hist(res$pvalue[res$baseMean > 1], 
     col="grey", border="white", xlab="", ylab="", main="")

A sorted results table:

resSort <- res[order(res$padj),]
head(resSort)
## log2 fold change (MAP): Condition Cancer vs Healthy 
## Wald test p-value: Condition Cancer vs Healthy 
## DataFrame with 6 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat        pvalue
##           <numeric>      <numeric> <numeric> <numeric>     <numeric>
## POTEE     204.72678      -5.884769 0.2021133 -29.11619 2.239816e-186
## LINC00273 244.37937      -8.109809 0.3464320 -23.40953 3.417835e-121
## MAP1B     580.26255      -5.575665 0.2446174 -22.79341 5.329831e-115
## PDF        87.62892      -3.807874 0.1684273 -22.60842 3.581484e-113
## NUDT3     215.98888      -3.806948 0.1719760 -22.13651 1.407210e-108
## ENDOG     133.27318      -3.452335 0.1638058 -21.07578  1.327090e-98
##                    padj
##               <numeric>
## POTEE     4.956490e-182
## LINC00273 3.781664e-117
## MAP1B     3.931461e-111
## PDF       1.981366e-109
## NUDT3     6.228032e-105
## ENDOG      4.894529e-95

Examine the counts for the top gene, sorting by p-value:

plotCounts(dds, gene=which.min(res$padj), intgroup="Condition")

Make normalized counts plots for the top 9 genes:

par(mfrow=c(3,3))
for (i in 1:9)  plotCounts(dds, order(res$padj)[i], intgroup="Condition")

A more sophisticated plot of counts:

library(ggplot2)
data <- plotCounts(dds, gene=which.min(res$padj), intgroup=c("Condition","Label"), returnData=TRUE)
ggplot(data, aes(x=Condition, y=count, col=Label)) +
  geom_point(position=position_jitter(width=.1,height=0)) +
  scale_y_log10()

Connecting by lines shows the differences which are actually being tested by results given that our design includes cell + Condition

par(mfrow=c(1,1))
ggplot(data, aes(x=Condition, y=count, col=Label, group=Label)) +
  geom_point() + geom_line() + scale_y_log10() 

A heatmap of the top genes:

library(pheatmap)
topgenes <- head(rownames(resSort),20)
mat <- assay(vsd)[topgenes,]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[,c("Condition","Label")])
pheatmap(mat, annotation_col=df, fontsize_col=6)

Getting alternate annotations

We can then check the annotation of these highly significant genes:

#library(org.Hs.eg.db)
#keytypes(org.Hs.eg.db)
#anno <- select(org.Hs.eg.db, keys=topgenes,
#               columns=c("SYMBOL","GENENAME"), 
#               keytype="ENSEMBL")
#anno[match(topgenes, anno$ENSEMBL),]
#for Bioconductor >= 3.1, easier to use mapIds() function

Looking up different results tables

The contrast argument allows users to specify what results table should be built. See the help and examples in ?results for more details:

#results(dds, contrast=c("cell","N61311","N052611"))

Surrogate variable analysis for RNA-seq

If we suppose that we didn’t know about the different cell-lines in the experiment, but noticed some structure in the counts, we could use surrograte variable analysis (SVA) to detect this hidden structure (see PH525x Course 3 for details on the algorithm).

library(sva)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
## 
##     collapse
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ Condition, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

Do the surrogate variables capture the cell difference?

par(mfrow=c(1,1))
plot(svseq$sv[,1], svseq$sv[,2], col=dds$Label,pch=16)
legend("topright",pch = 16, col=1:5,levels(dds$Label))

#text(svseq$sv[,1], svseq$sv[,2], 1:ncol(dds), pos=1)

Do the surrogate variables capture the health condition?

plot(svseq$sv[,1], svseq$sv[,2], col=dds$Condition,pch=16)
legend("topright",pch = 16, col=1:3,levels(dds$Condition))

#text(svseq$sv[,1], svseq$sv[,2], 1:ncol(dds), pos=1)

Using the surrogate variables in a DESeq2 analysis:

#dds.sva <- dds
#dds.sva$SV1 <- svseq$sv[,1]
#dds.sva$SV2 <- svseq$sv[,2]
#design(dds.sva) <- ~ SV1 + SV2 + Condition
#dds.sva <- DESeq(dds.sva)

Session info

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] sva_3.22.0                 genefilter_1.56.0         
##  [3] mgcv_1.8-17                nlme_3.1-131              
##  [5] pheatmap_1.0.8             ggplot2_2.2.1             
##  [7] hexbin_1.27.1              vsn_3.42.3                
##  [9] DESeq2_1.14.1              SummarizedExperiment_1.4.0
## [11] Biobase_2.34.0             GenomicRanges_1.26.4      
## [13] GenomeInfoDb_1.10.3        IRanges_2.8.2             
## [15] S4Vectors_0.12.2           BiocGenerics_0.20.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10          locfit_1.5-9.1        lattice_0.20-35      
##  [4] rprojroot_1.2         digest_0.6.12         plyr_1.8.4           
##  [7] backports_1.0.5       acepack_1.4.1         RSQLite_1.1-2        
## [10] evaluate_0.10         BiocInstaller_1.24.0  zlibbioc_1.20.0      
## [13] lazyeval_0.2.0        data.table_1.10.4     annotate_1.52.1      
## [16] rpart_4.1-11          Matrix_1.2-10         checkmate_1.8.2      
## [19] preprocessCore_1.36.0 rmarkdown_1.5         labeling_0.3         
## [22] splines_3.3.3         BiocParallel_1.8.2    geneplotter_1.52.0   
## [25] stringr_1.2.0         foreign_0.8-68        htmlwidgets_0.8      
## [28] RCurl_1.95-4.8        munsell_0.4.3         base64enc_0.1-3      
## [31] htmltools_0.3.6       nnet_7.3-12           tibble_1.3.0         
## [34] gridExtra_2.2.1       htmlTable_1.9         Hmisc_4.0-2          
## [37] XML_3.98-1.6          bitops_1.0-6          grid_3.3.3           
## [40] xtable_1.8-2          gtable_0.2.0          affy_1.52.0          
## [43] DBI_0.6-1             magrittr_1.5          scales_0.4.1         
## [46] stringi_1.1.5         XVector_0.14.1        affyio_1.44.0        
## [49] limma_3.30.13         latticeExtra_0.6-28   Formula_1.2-1        
## [52] RColorBrewer_1.1-2    tools_3.3.3           survival_2.41-3      
## [55] yaml_2.1.14           AnnotationDbi_1.36.2  colorspace_1.3-2     
## [58] cluster_2.0.6         memoise_1.1.0         knitr_1.15.1